Assignment #6

Chicago Food Inspections - NoSQL

Author: Atef Bader Last Edit: 11/26/2018 Modified: 25 May 2019 sd version b Added sample code for df exploration Added sample code for nosql query exploration Modified: 9 Nov 2021 sd 70 points rename to Assignment 6

Deliverables:

Objectives:

In this assignment, you will:

Assignment Description and Requirement Specifications

Chicago Food Inspections

Recent watchdog report published by Chicago Tribune indcated that food safety inspectors overlook hundreds of day cares in the city of Chicago.

image.png

The key take away from the Chicago Tribune watchdog report is that the city had only 33 working field inspectors to cover the entire city of Chicago. Many of the facilties serve food for Children, and while few fail inspectionns, many escape routine inspections.

This is a classic resource allocation problem. In this assignment, our goal is to identify the hot-spots (areas that have facilities serving food to children and have failed inspections in the past) on the Chicago map to dispatch inspectors to.

To achive our goal, we need the following:

  1. Dataset for Chicago Food Inspections
  2. NoSQL database Egnine (ElasticSearch) for indexing and data retrieval
  3. HeatMap to plot the children facilties that failed Chicago Food Inspections

The CSV file for dataset of the city of chicago is obtained from the data portal for the city of Chicago. Here th elink for the city of Chicago data portal City of Chicago Data Portal

image.png

Loading the Dataset CSV file

Lets load the CSV file into a DataFrame object and see the nature of the data that we have.

Description of the dataset:

  1. It has 164953 inspection records
  2. It has inspection records from 2010 to 2018
  3. It has 17 fields

There are few fields in the dataset of interest for us:

  1. Risk
  2. Results
  3. Latitude
  4. Longtitude
  5. Inspection ID

We are also interested in any field that mentioned (or misspelled) the word Children

There are possibilities that the data entry clerk might've made some typos and misspellings and there are different words meant to indicate the same thing, some examples of this:

To perform different queries to retrieve the relevant inspection records, we will store the dataset in a NoSQL database engine ElasticSearch.

For more information on elastic search visit ElasticSearch

Please note that in this version of the assignment, the index for Chicago food inspections dataset already created on ElasticSearch on DSCC

ElasticSearch

elasticsearch package

We need elasticsearch package to connect to ElasticSearch Servers

To install elastic search pakage, execute following command from the command/terminal windows:

Before running the cell below, connect to NW via VPN

To learn more about loading and indexing the Inspection Records into ElasticSearch

Inspection records are inserted into ElasticSearch engine using the bulk Api of elastic search.

Here is the link API DOCS for the API documentation.

Query is used to retieve data from ElasticSearch server

The query is used to retrieve data from ElasticSearch servers that match certain filters.

For information about the syntax and semantics for query, you can read the docs at the following URL QUERY DOCS

We will also use the scroll to retrive the data matching the our query. For more information about scroll, you can read the docs ta the following URL Scroll DOCS

We create our query to rertieve the inspections records we are interested in three experiements and will compare the results for each:

  1. Experiment #1: Using Regular Expressions using the term Children
  2. Experiment #2: Using Fuziness using the term Children's
  3. Experiment #3: Using Fuziness using the term Children

Boolean Query

This is a query that will match documents based upon the matching combinations of queries. The bool query maps to Lucene BooleanQuery. It is built using one or more boolean clauses, each clause with a typed occurrence. The occurrence types are:

Order Description
must The clause (query) must appear in matching documents and will contribute to the score.
filter The clause (query) must appear in matching documents. However unlike must the score of the query will be ignored.
should The clause (query) should appear in the matching document.
must_not The clause (query) must not appear in the matching documents. Clauses are executed in filter context meaning that scoring is ignored and clauses are considered for caching. Because scoring is ignored, a score of 0 for all documents is returned.


The bool query takes a more-matches-is-better approach which means the score from each matching must or should will be added together to provide the final _score for each document.

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/query-dsl-bool-query.html

Query String Query

Returns documents based on a provided query string, using a parser with a strict syntax.

This query uses a syntax to parse and split the provided query string based on operators, such as AND or NOT. The query then analyzes each split text independently before returning matching documents.

You can use the query_string query to create a complex search that includes wildcard characters, searches across multiple fields, and more. While versatile, the query is strict and returns an error if the query string includes any invalid syntax.

Example

{"query_string": {
    "query": "Children",  
    "fields": ["Facility Type","Violations","DBA Name"] #Multi-field matching query
             }

https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-dsl-query-string-query.html#query-string-syntax

Experiment #1: Create the query using regex

Process the retrieved documents and filter fields we need for the Heatmap

We need to create a list-of-lists of the two fields, (Latitude and Longitude) for the HeatMap

We need to install folium package to plot the Map and Heatmaps

The official documentation can be accessed at this URL: Folium

To install Folium package execute following command from the Command/Terminal window:

For the different configuration paramteres for HeatMap, you can access the docs at this URL: HeatMap

Create the HeatMap

Create the query using fuzziness

Now lets try to retrieve documents using ElasticSearch fuzziness

The fuzzy query generates all possible matching terms that are within the maximum edit distance specified in fuzziness.

For information about the syntax and semantics for fuziness, you can read the docs at the following URL fuzziness

Experiment #2: We will first build our query with the parameters:

  1. "query": "Children~",
  2. "fuzziness": "1",

Experiment #3: Lets now build our query with the parameters:

  1. "query": "Children's~",
  2. "fuzziness": "1",

Frequent Violators:

Despite the fact that the city of Chicago has the department of Business Affairs and Consumer Protection to revoke business licensses to protect consumers, it appears many businesses with frequent violations have obtained new licenses under the same DBA name

image.png

Experiment #4: Lets get the top list of frequent violators:

Facilities that serve children can be classified under different Facility Types:

  1. Daycare Above and Under 2 Years
  2. Children's Services Facility
  3. Daycare (2 - 6 Years)

We will use ELasticSearch and Folium to plot on the map those facilities that failed inspection at least 5 times with risk high.

image.png

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/search-aggregations.html

Loopholes

As you might have guessed by now, it must be really cheap to do so, those frequent violators reobtain business license multiple times under the same business name for only $165 application fee based on the official numbers published on the City of Chicago - Business Licensing

And it appears the city of Chicago is willinig to rubber-stamp the approval of the application for only $165, rather than imposing the very simple rule: ( 3 strikes and you are out )

image.png

Requirements

Requirement #1: (10 points)

Provide your comparative analysis for the results obtained from Experiments 1, 2 and 3 above. How do the three different search methods differ in results?

Experiement 1 query with the term "Children" using wildcard matching with regular expression reulted in 601 hits. Experiment 2 with the term "Children" and fuzziness mathing set to 1 resulted in 196 hits. And Experiement 3 with term "Children's" and fuziness matching set to 1 resulted in 505 hits. Experiment 1 returned the most results which could lead to a decrease in precision but an increase in recall. Experiment 2 and 3 produced more narrow queries which could increase the precision, but may decrease the recall.

Requirement #2: (15 points)

Rerun Experiments #1, #2, #3 but searching for "Child" matches instead of "Children". It is optional to produce the heatmaps for each of the experiments, but you will need to show the code and results of each query. Add as many cells as needed for your code.

In a markdown cell, explain how these 3 queries differ to each other and how they differ to the "Children' queries.

The frist query uses regular expression matching without fuzziness to match the word "Child". This query returns 774 hits. The second query matches the word "Child" with fuzziness set to 1 and results in 300 hits. The last query matches "Child's" with fuzziness set 1 and returns with only 24 hits.

Compared to the first set of queries for the word "Children", the "Child" query returns more hits without fuzziness. This could result in a decrease in precision as the total number of doucments retrieved are greater. However, more relevant documents may be retrieved leading to an increase in recall. When applying a fuzziness matching set to 1, the query for "Child" prduces more results than "Chilredn. Again potentially decreasing presicion and increasing recall. The "Child's" query results in less hits than the "Children's" query in Experiment 3 which could increase precision and decrease recall.

Requirement #3: (15 points)

In Experiment #4 we have obtained the list of frequent violators. Using the df_top_frequent_violators dataframe produce a table that shows DBA Name, number of violations and number of licenses issued for every DBA Name.

Requirement #4: (15 points)

Plot on the Heatmap those frequent violators who have obtained 3 business licenses or more under the same DBA Name through out the liftime of their business. You can answer this question from data already queried or you can do a new Elasticsearch query.

Requirement #5: (15 points)

Plot on the Heatmap those facilites that serve children but failed inspections with high risk, and MICE DROPPINGS in the Violations.